1.2 Introduction to Statistics

1 Statistical Models

Statistical Model

A statistical model is a family $P$ of candidate distributions for data $X$ . We assume $X \sim P$ for some $P \in P$ , but don't know which $X$ yields evidence about which $P$ .

Example: Coin Flipping

Suppose $48$ flippers tossed $n = 350757$ coins. $X = 178079$ landed same-side up.

Model 1: all flips are independent, with $θ \in (0, 1)$ . Then $X \in P_{θ} = Binomial (n, θ)$ , with $p_{θ} (x) = (\binom{n}{x}) θ^{x} (1 - θ)^{n - x}, x = 0, \dots, n .$ So $P = {P_{θ} | θ \in (0, 1)}$ .
Model 2: flippers have different same-side biases, and are independent. $X_{i} \overset{i . i . d}{\sim} B_{i} (n_{i}, θ_{i})$ , $θ = (θ_{1}, \dots, θ_{48}) \sim (0, 1)^{48}$ .
Model 3: Biases get smaller over time, $X_{i, t} \overset{i . i . d}{\sim} Bernoulli (θ_{i, t})$ . May add a constrain $θ_{i, 1} \geq \dots \geq θ_{i, n_{i}}$ .

1.1 Parametric vs Nonparametric Models

Parametric models are distributions indexed by $θ \in Θ$ . So $P = {P_{θ} | θ \in Θ}, Θ \in R^{d}$ . ( $d$ is called model dimension)
Denote $P_{θ} (\cdot), E_{θ} (\cdot)$ , also indexed by $θ$ .
On the contrary, non-parametric models means no natural way to parameterize $P$ by real vector.

Simple case for non-parametric models is $X_{1}, \dots, X_{n} \overset{i . i . d}{\sim} P$ , $P$ is ANY distribution on $R$ . Then $P = {P^{n} | P is a distribution on R}$ , $X = (X_{1}, \dots, X_{n}) \sim P^{n}$ .

However, we can use "parametric notation" $P = {P_{θ} | θ \in Θ}$ WLOG. (we can always denote $θ = P, Θ = P$ )

1.2 Bayesian vs Frequentist Inference

So far we assume data $X$ follows a distribution $P_{θ}$ for parameter $θ$ we want to determine. Sometimes we will add the Bayes assumption that $θ$ itself is random, drawn from a known distribution $Λ$ we call prior.
This helps reduce the problem of inference, and we can focus on the conditional distribution of $θ | X$ .
However, before introducing Bayesian Inference, we assume $θ$ as a fixed value in $Θ$ .

2 Estimation

We want to determine the value of a parameter in a parametric model.

Skeptical answer: could be anything.
Bayesian answer: assume $θ$ random with prior distribution, and calculate conditional distribution of $θ | X$ .
Frequentist answer: inductive behavior: find a method for using $X$ to estimate $θ$ . E.g. $δ_{0} (X) = \frac{X}{n}$ . Show it generally works well for every $θ$ .
General step: $P = {P_{θ} | θ \in Θ}$ . Estimate $g (θ)$ . Observe $X$ , calculate estimate $δ (X)$ using $δ (\cdot)$ .

Loss Function, Risk Function

Loss function $L (θ; d)$ is the disutility of guess $g (θ) = d$ . Typically non-negative, and $L (θ, g (θ)) = 0$ .

E.g., square error loss $L (θ, d) = (g (θ) - d)^{2}$ .

Risk function is the expected loss of an estimator: $R (θ; δ (\cdot)) = E_{θ} [L (θ, δ (X))] .$

For square error loss, it is called mean square error (MSE): $MSE (θ; δ (\cdot)) = E_{θ} [(δ (X) - g (θ))^{2}] .$

In brief, we have two primary strategies to choose an estimator:

Summarize the risk function by a scalar.
Restrict attention to a smaller class of estimators.

Example

Suppose we stick to model 1 in here. For $δ_{0} (x) = \frac{x}{n}$ , $E_{θ} (\frac{X}{n}) = θ$ , so $MSE (θ; δ_{0}) = {Var}_{θ} (\frac{X}{n}) = \frac{θ (1 - θ)}{n} .$
We can show several estimators: $δ_{1} (X) = \frac{X + 1}{n + 2}, δ_{2} (X) = \frac{X + 2}{n + 4}, δ_{3} (X) = \frac{X + 1}{n} .$ Pasted image 20241202000015.png|500
Compared with $δ_{0}$ , the variance of $δ_{1}, δ_{2}$ decreases. For $δ_{3}$ , both bias and variance increases.

2.1 Comparing Estimators

Inadmissible, Strictly Dominate

An estimator $δ$ is inadmissible if $\exists δ^{*}$ , with

$R (θ; δ^{*}) \leq R (θ; δ), \forall θ$ ;
$R (θ; δ^{*}) < R (θ; δ), \exists θ$ .

We say $δ^{*}$ strictly dominates $δ$ .

2.2 Resolving Ambiguity

There is no estimator that uniformly attains the smallest risk among all estimators. Like the brute $δ (X) \equiv \frac{1}{2}$ , it performs the best when $θ$ is actually $\frac{1}{2}$ . There are two main strategies to resolve this ambiguity.

Summarize the risk function by a scalar.

Average-case risk (Bayes estimation): minimize some (weighted) average of the risk function over $Θ$ : $min_{δ (\cdot)} \int_{θ} R (θ; δ) d Λ (θ) .$
If $Λ (Θ) < \infty$ , we can assume WLOG that $Λ$ is a probability measure. Then this average is the same as $E_{θ \sim Λ} [R (θ; δ)]$ called the Bayes risk. An estimator minimizing the Bayes risk is called the Bayes estimator.
Worst-case risk (Minimax estimation): $min_{δ (\cdot)} sup_{θ \in Θ} R (θ; δ) .$ This will push us to choose estimators with flat risk functions, like $δ_{2} (X) = \frac{X_{2}}{n + 4}$ from the example.

Restrict the choice of estimators
Unbiased estimation: we can demand an estimator to satisfy $E_{θ} [δ_{0} (X)] = g (θ), \forall θ \in Θ$ .
Under unbiasedness, we can clearly define the optimal estimator called UMVU estimator. From the above example, $δ_{0} (X) = \frac{X}{n}$ is actually the UMVU estimator.